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I.  Parallel  Computer  System  Design  Evaluation  and  Simulation  Environment 

One  of  the  major  activities  of  CSRD  is  evaluating  various  multiprocessor  architectures. 
Simulation  is  the  only  way  of  predicting  the  performance  of  and  comparing  different  multiproces¬ 
sor  organizations.  The  problem  is  such  simulations  take  enormous  amounts  of  computer  time 
even  when  a  system  model  is  not  very  detailed.  This  is  not  surprising  as  uniprocessor  simula¬ 
tions  using  large  realistic  benchmarks  can  be  extremely  time  consuming.  Multiprocessor  simula¬ 
tion  complexity  grows  at  least  as  fast  as  the  number  of  processors  in  the  system,  and  in  reality 
quite  a  bit  faster  due  to  interaction  between  processors.  Parallel  simulation  is  one  approach  to 
increasing  the  speed  of  simulations  of  computer  systems.  It  requires  a  parallel  system  to  execute 
the  simulator  to  achieve  the  speedup.  We  have  used  additional  CEs  purchased  from  Alliant  to 
increase  the  amount  of  hardware  parallelism  to  improve  performance. 

Two  main  parallel  simulation  activities  we  have  at  CSRD  are  1)  simulating  a  class  of  sys¬ 
tems  represented  by  Cedar,  and  2)  developing  a  parallel  execution  system  for  a  high-level  system 
description  language.  The  first  activity  represents  a  special-purpose  parallel  simulator  that  can 
only  be  re-configured  to  model  one  class  of  systems.  The  second  one  is  a  part  of  an  environment 
project  that  has  as  its  goals  ease  of  description,  model  reuse,  and  fast  evaluation  of  different 
architectures.  In  both  cases  the  underlying  idea  is  to  use  a  register-transfer  level  description  of  a 
system  and  exploit  parallelism  available  in  hardware  at  this  level.  Both  systems  are  hierarchical, 
however,  and  do  not  require  everything  to  be  modeled  at  this  level. 

The  special-purpose  simulator  has  been  used  to  simulate  shared  memory  multiprocessor 
systems  with  interleaved  memory  system  and  two  unidirectional  interconnection  networks.  We 
have  modeled  the  effect  of  various  parameters  for  a  range  of  systems  sizes  concentrating  on  from 
32  to  512  processors.  Both  synthetic  and  trace-driven  workloads  have  been  used  and  the  perfor¬ 
mance  of  the  two  models  correlated.  System  parameters  varied  include  network  organization, 
memory  interleaving  and  speed,  data  path  and  message  length,  burst  traffic  characteristics, 
conflict  resolution,  etc  [Turn89].  We  are  in  the  process  of  verifying  the  simulator  by  modeling 
the  current  Cedar  configuration  and  comparing  the  results  with  actual  measurements. 

The  general-purpose  parallel  simulator  is  based  on  the  idea3  and  experience  derived  from 
the  special-purpose  case.  We  currently  have  a  prototype  execution  system  developed  and  are 
evaluating  its  performance.  One  of  the  main  questions  we  are  trying  to  answer  is  the  amount  of 
parallelism  one  can  find  and  exploit  on  average  in  simulating  a  multiprocessor  system.  The 
experience  with  special-purpose  simulation  and  preliminary  results  with  the  general-purpose  sys¬ 
tem  are  very  encouraging.  The  mechanism  used  for  the  parallel  evaluation  and  the  data  struc¬ 
tures  involved  can  have  a  major  impact  on  performance.  We  are  considering  alternatives  and 
will  evaluate  their  behavior  and  performance. 

II.  Automatic  Restructuring  and  Concurrent  Evaluation  of  Lisp 

ELI.  Parcel 

Parcel  is  an  investigation  of  the  compilation  of  sequential  Scheme  programs  for  execution  on 
a  shared-memory  multiprocessor.  The  Parcel  system  has  two  components:  a  parallelizing  com¬ 
piler,  that  produces  from  a  sequential  Scheme  program  a  machine-independent  parallel  inter¬ 
mediate  form,  and  a  code  generator/run-time  system,  via  which  the  intermediate  form  is 
expanded  into  machine  instructions  for  the  Alliant  FX/8  and  linked  to  a  parallel  run-time  sys¬ 
tem  to  form  an  executable. 


II.2.  The  Parcel  Compiler 

The  Parcel  compiler  operates  in  three  phases:  interprocedural  analysis,  standard  transfor¬ 
mations,  and  parallelizing  transformations. 

Interprocedural  analysis  in  Parcel  takes  the  form  of  an  abstract  interpretation  of  the  pro¬ 
gram.  The  abstract  domains  over  which  this  interpretation  takes  place  are  built  upon  a  mechan¬ 
ism  for  recording  the  interprocedural  movements  made  by  dynamically  allocated  objects.  In 
terms  of  these  movements,  the  analysis  determines  the  visibility  of  any  side-effects  upon  an 
object,  and  determines  the  lifetime  of  the  object,  in  such  a  way  that  it  may  be  placed  on  a  stack 
or  in  a  hierarchical  memory  as  its  required  lifetime  permits.  This  analysis  is  described  in  detail 
in  [Harr89], 

A  number  of  standard  transformations  are  performed  upon  each  procedure  of  the  program 
by  Parcel,  prior  to  parallelization.  These  transformations  are  optimizations  that  are  beneficial 
from  the  standpoint  of  both  parallel  and  sequential  execution.  For  example,  procedure  integra¬ 
tion  is  performed  because  it  both  eliminates  the  expense  of  procedure  invocation,  and  makes  code 
motion  during  parallelization  more  effective,  by  allowing  it  to  be  applied  to  a  larger  procedure 
body.  Similarly,  invariant  floating  eliminates  needless  repetition  of  an  evaluation,  and  may  at 
the  same  time  remove  spurious  control  dependences  from  a  procedure. 

A  number  of  parallelizing  transformations  tailored  specifically  to  symbolic  programs  ori¬ 
ginated  in  Parcel.  For  example,  we  introduced  a  quite  general  means  of  paralleling  exit  loops 
(generalized  While/Repeat  structures)  by  means  of  boolean  recurrences  in  conjunction  with  loop 
distribution.  A  transformation  called  recursion  splitting  is  used  to  rewrite  self-recursive  pro¬ 
cedures  as  a  complementary  pair  of  loops,  one  of  which  performs  the  downward  portion  of  a 
recursive  computation,  the  other  of  which  performs  the  upward  portion.  These  transformations 
are  described  in  [Harr89,  HaPa88). 

H.3.  The  Parcel  Run-time  System 

From  every  procedure  of  a  program,  Parcel  produces  two  transformed  versions,  one  sequen¬ 
tial  and  the  other  parallel.  These  two  versions  are  presented,  in  intermediate  form,  to  the  code 
generator,  which  expands  them  into  machine  code  and  links  them  to  the  run-time  system  to 
form  an  executable. 

The  run-time  system  has,  in  addition  fo  the  usual  sequential  functionality  of  a  Scheme 
run-time  system,  a  microtasking  library  f'  ta**  scheduling  of  parallel  activity,  a  parallel  storage 
allocator  and  deallocator  (garbage  collector ,  ad  a  library  of  parallel  recurrence  solvers,  for 
recurrences  over  numeric  and  symbolic  data  that  are  recognized  and  isolated  by  the  compiler. 

The  demands  upon  the  Parcel  microtasking  library  are  quite  strenuous,  for  two  reasons. 
First,  the  parallel  codes  produced  by  the  compiler  contain  deeply  nested,  often  recursive  parallel¬ 
ism;  second,  the  shape  and  granularity  of  the  parallel  computation  varies  dramatically  from  one 
program  to  another,  and  even  from  one  data  set  to  another.  It  is  the  responsibility  of  the  micro- 
tasking  library  to  select  between  parallel  and  sequential  procedure  versions  to  achieve  a  sufficient 
degree  of  parallelism  and  load  balancing  while  retaining  as  much  of  the  efficiency  of  the  sequen¬ 
tial  computation  as  possible.  More  stacks  that  processors  are  used  by  the  microtasking  system, 
to  break  the  traditional  linkage  between  processors  and  atacks,  for  more  flexible  and  efficient 
scheduling  of  parallel  threads  of  activity.  The  microtasking  system  of  Parcel  is  described  in 
[ChHaQOj. 


II.4.  Mlprac 

Miprac  is  a  successor  to  Parcel,  in  which  the  interprocedural  analysis  and  transformations 
developed  for  Scheme  programs  in  Parcel  are  fused  with  older  techniques  for  the  analysis  and  res¬ 
tructuring  of  Fortran  programs,  to  yield  a  flexible  tool  for  the  analysis  and  restructuring  of  pro¬ 
cedural  languages  generally.  In  the  past,  the  intermediate  form  upon  which  a  parallelizing  com¬ 
piler  operated  had  a  primarily  syntactic  correspondence  to  the  program  being  compiled.  For 
example,  the  intermediate  form  of  Parafrase  and  KAP  is  essentially  a  syntax  trees  of  the  source 
program.  We  propose,  however,  an  intermediate  form  with  a  primarily  semantic  correspondence 
to  the  program  being  compiled.  To  this  end,  we  have  designed  MIL  (Miprac  Intermediate 
Language)  to  have  three  types  of  constructs:  control  (iteration,  condition,  sequence,  branch), 
memory  access  (read,  write,  synchronization,  file  manipulation),  and  operations  (addition,  multi¬ 
plication,  etc).  These  three  types  of  constructs  correspond  closely  to  common  semantic  domains, 
so  that  to  rewrite  a  program  into  MIL  is  very  much  like  giving  the  semantics  of  the  program  in 
terms  of  prescribed  semantic  domains.  We  are  implementing  translators  from  Fortran  77,  C, 
and  Scheme  into  MIL,  in  order  to  demonstrate  its  flexibility. 

Because  MIL  allows  branching,  it  has  a  non-compositional  semantics  (a  continuation  seman¬ 
tics).  However,  in  Miprac  most  transformations  apply  to  NMIL  (Normalized  MIL)  in  which 
branching  has  been  eliminated  by  control-flow  normalization.  NMIL  has  compositional  seman¬ 
tics,  so  that  dataflow  analysis  and  transformation  of  a  NMIL  program  is  straightforward  to 
implement  and  verify.  The  control-flow  normalization  used  in  Miprac  derives  from  that  intro¬ 
duced  in  PAF  [TaDF88],  and  is  described  in  [Amma89]. 

Interprocedural  analysis  in  Miprac  is  an  extension  of  the  analysis  performed  in  Parcel  to 
permit  arithmetic  on  pointer  variables  (generalized  access  in  blocks  of  dynamically  allocated 
storage)  as  well  as  closures  (higher-order  procedures  with  side-effects).  Because  it  is  essentially 
typeless,  MIL  easily  accommodates  programs  written  in  languages  like  C  and  Fortran  in  which 
type  casting  and  outright  type  violations  are  commonplace. 

ID.  Parallel  Iterative  Solver  for  Sparse  Nonsymmetric  Linear  Systems 

Practical  iterative  solvers  for  large  sparse  nonsymmetric  linear  systems  Ax—b  usually 
require  A  to  satisfy  special  properties,  such  as  having  a  positive  definite  symmetric  part  or  all 
eigenvalues  with  positive  real  parts.  One  class  that  guarantees  convergence  for  any  nonsingular 
matrix  A  are  row  projection  (RP)  methods,  iterative  algorithms  that  partition  the  rows  of  A  into 
blocks  AT =[AltA2,...,Am)  and  then  compute  orthogonal  projections  onto  range(A,)  (i=l,...,m)  on 
each  iteration.  Two  general  classes  of  RP  methods  have  been  examined,  the  product  or 
Kaczmarz  [Kacz39]  and  sum  or  Cimmino  [Cimm39]  forms.  Both  iterative  forms  are  accelerated 
using  conjugate  gradients,  as  initially  suggested  by  A.  Bjorck  and  T.  Elfving  in  [BjE179]  and 
implemenced  for  two-dimensional  problems  by  C.  Kamath  and  A.  Sameh  in  [Kama86]. 

Several  theoretical  and  numerical  results  have  been  obtained  for  RP  methods  in  the  last 
year.  Among  them  are: 

•  The  limit  points  of  RP  methods  have  been  characterized,  for  arbitrary  choices  of  iteration 
parameters  and  block  row  partitionings  [Bram89].  RP  methods  converge  even  when  th*  system 
is  singular  or  rectangular,  and  for  least  squares  problems  it  is  important  </0  know  what  ‘solution’ 
the  method  finds. 


•  The  choice  of  w=l  has  been  shown  to  be  optimal  for  the  iteration  parameter  for  both  theoreti¬ 
cal  and  computational  reasons  [KaSa88,  BrSa89a]. 


•  The  underlying  connections  between  the  sum  and  product  RP  forms  and  CG  applied  to  the  nor¬ 
mal  equations  have  been  discovered.  This  provides  a  basis  of  comparison  of  the  methods,  as  well 
as  one  guide  for  choosing  block  row  partitionings. 

•  By  combining  the  properties  of  orthogonal  projectors  with  those  of  the  CG  algorithm,  one  pro¬ 
jection  per  CG  iteration  can  be  eliminated  by  using  a  special  choice  of  the  starting  vector 
[BrSa88]. 

•  A  new  RP  method  has  been  developed  [BrSa89aj.  CG  acceleration  of  this  Bystem  provides  an 
error-minimizing  algorithm,  and  the  new  system  also  allows  an  explicit  reduction  in  the  problem 
size. 

•  An  error  analysis  of  the  methods  showed  that  it  is  essential  to  compute  the  projections  accu¬ 
rately.  This  in  turn  restricts  the  types  of  block  row  partitionings  allowable. 

•  An  approach  to  finding  row  partitionings  that  allow  parallelism  in  the  computations  has  been 
developed  [BrSa89b],  This  approach  is  based  on  a  domain  decomposition  idea,  and  depends  only 
on  the  discretization  operator  used  and  not  on  any  regularity  of  the  domain  of  the  PDE. 

The  last  item  yields  RP  methods  implementations  that  are  frequently  faster  than  other 
solvers,  because  for  problems  on  an  n  X  n  X  n  mesh,  e.g.,  they  provide  0(n2)  parallelism  for 
speed  and  work  on  subproblems  of  size  0(n),  giving  good  data  locality. 


N=13824 

N=216000 

Method 

1 

2 

3 

4 

5 

6 

1 

2 

3 

4 

5 

6 

KACZ 

V-RP 

MX 

MX 

CEMM 

MX 

GMRES 

RS 

RS 

RS 

TM 

ILU 

RS 

RS 

RS 

RS 

UP 

UI 

UP 

MILU 

RS 

RS 

RS 

RS 

RS 

RS 

UP 

RS 

RS 

UP 

CGNE 

MX 

MX 

MX 

MX 

ILCG 

UP 

MX 

MX 

MX 

UP 

TM 

UP 

TM 

Table  1:  Failures  Among  Methods 

RP  methods  have  proven  to  be  more  robust  numerically  than  other  commonly  used  non- 
symmetric  iterative  algorithms,  including  the  theoretically  robust  method  of  conjugate  gradients 
(CG)  applied  to  the  normal  equations  AT  Ax=ATb  and  Krylov  subspace  methods.  Table  1  shows 
the  robustness  results  of  testing  with  six  three-dimensional  non-selfadjoint  elliptic  PDE’s  from 
[BrSa89a],  and  that  source  should  be  consulted  for  the  particulars  of  the  equations  and  discretiza¬ 
tions  used.  KACZ  is  the  product  form  RP  method,  CIMM  the  sum  form,  and  V-RP  the  newly 


developed  RP  method.  GMRES  refers  to  GMRES(IO),  CGNE  to  CG  on  the  normal  equations, 
(M)ILU  to  GMRES(IO)  preconditioned  with  (MJILU,  and  ILCG  to  an  ILU  preconditioned  CGNE. 
All  were  tested  on  systems  of  order  N,  with  N  given  in  the  table.  Error  conditions  are  given  by 
the  following  codes: 

•  MX:  Maximum  iterations  reached 

•  UP:  Unstable  preconditioner 

•  RS:  Residual  stagnates 

•  TM:  Maximum  allowed  CPU  time  reached 

Clearly  the  RP  solvers  have  a  robustness  unmatched  by  the  other  methods  tested. 
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